feat: generalize ChunkManifest to hold inline chunks#938
feat: generalize ChunkManifest to hold inline chunks#938TomNicholas merged 26 commits intozarr-developers:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #938 +/- ##
==========================================
+ Coverage 89.35% 89.91% +0.56%
==========================================
Files 33 33
Lines 2038 2053 +15
==========================================
+ Hits 1821 1846 +25
+ Misses 217 207 -10
🚀 New features to boost your workflow:
|
TomNicholas
left a comment
There was a problem hiding this comment.
This is awesome, thanks @maxrjones .
My main concern about this PR is that it changes private internals and public behaviour in one go. The more neat way to do it would be to leave any changes to the Kerchunk and Icechunk readers/writers to follow-up PRs.
| _paths: np.ndarray[Any, np.dtypes.StringDType] | ||
| _offsets: np.ndarray[Any, np.dtype[np.uint64]] | ||
| _lengths: np.ndarray[Any, np.dtype[np.uint64]] | ||
| _native: dict[tuple[int, ...], bytes] |
There was a problem hiding this comment.
How is a scalar native array handled? An empty tuple? Worth having a test for that case.
There was a problem hiding this comment.
added a test in [8fc3de3 (this PR)](8fc3de3 (this PR), and yes - https://github.com/maxrjones/VirtualiZarr/blob/8fc3de361e244329b3358abdcc7ba58b3627a5f6/virtualizarr/tests/test_manifests/test_manifest.py#L520-L521
|
I'm also not totally clear on the intended relationship between this feature and the parsers. Do parsers choose if a chunk is native? 1 IIUC the user has no way to use this feature via Footnotes
|
|
Thanks for the review @TomNicholas. I addressed all your comments in the last set of commits.
Good point, I pulled out the Kerchunk reader portion. I think the writers need to know what to do with inlined data immediately, otherwise you could get buggy serialized data if someone were to write a parser that uses the inlined data feature.
That's right. Parsers chose which chunks are inlined. The user chooses which arrays are loaded via |
I mean you could just update the writers to raise
Cool. So it is really an implementation detail. It's only relevant for Parser authors. So doesn't that mean the docs for it should only be under "Writing a custom parser"? |
|
@maxrjones the changes don't look right any more - where is the new attribute on chunk manifest for holding the buffer? Also please:
|
will do |
|
Ah sorry. We also need to test other basic operations on manifests in this PR, such as concatenation, broadcasting, and any slicing we support. Might be easier to change fixtures of existing tests than to add new ones. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broadcast should replicate inlined chunk bytes to every position along an expanded axis, matching the behaviour already observed for virtual chunks. Three of the four new tests fail under the current implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously _broadcast_manifest only prepended singleton dimensions to inlined chunk keys, leaving a single dict entry even when np.broadcast_to expanded an axis. Reads at the replicated positions would find the INLINED_CHUNK_PATH sentinel in the paths array but miss the _inlined dict, producing broken behaviour in ManifestStore.get. Now we replicate each inlined entry to every target position along any axis that was size 1 in the source, mirroring how the paths/offsets/lengths arrays are broadcast. The bytes themselves are shared by reference, not copied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in the existing behaviour of _concat_manifests and _stack_manifests for manifests containing inlined chunks: keys are shifted along the concat axis or gain the stack-axis index, and bytes are shared by reference rather than copied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Confirms replicated entries share the same bytes object rather than allocating copies at each expanded position. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When two ManifestArrays share paths/offsets/lengths but have different
inlined chunk data, ManifestArray.__eq__ falls through to its 'over-cautious'
fallback via ChunkManifest.elementwise_eq, which does not currently compare
inlined bytes. That triggers RuntimeWarning('Should not be possible to get here').
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously elementwise_eq only looked at paths/offsets/lengths, which all agree for inlined chunks even when their bytes differ. That let two ChunkManifests disagree per __eq__ but look identical per elementwise_eq, tripping the 'Should not be possible to get here' branch in ManifestArray.__eq__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the inlined-chunk branch in ManifestStore.get including byte-range variants (RangeByteRequest, OffsetByteRequest, SuffixByteRequest), a mixed manifest where inlined and virtual chunks are served from the same array, and list_dir enumeration of inlined chunk keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
|
@maxrjones I hope you don't mind but I took over this PR. I responded to my own review by making these changes:
|
this is great, thank you very much for taking this on! Those changes are all very helpful. I haven't done a line-by-line review, but would be happy with you merging whenever you feel it's ready. |
Validation previously used a subset check, which silently accepted entries
with unknown keys alongside the required path/offset/length. Now the entry
key set must match exactly one of the two valid shapes: virtual ({path,
offset, length}) or inlined ({path, offset, length, data}).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Calls out the path-value convention used by ChunkManifest entries so parser authors have a single, discoverable reference for distinguishing virtual, missing, and inlined chunks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broadcasting inlined chunks not only prepends singleton dims to their keys, but also replicates the bytes (by reference) across every position of an expanded axis, per the fix in zarr-developers#938. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#981) * Add failing test for writing inlined chunks to icechunk The icechunk writer currently sends the INLINED_CHUNK_PATH sentinel ('__inlined__') straight into icechunk's set_virtual_refs_arr, which rejects it as a malformed virtual URL. The new test writes a manifest containing one inlined chunk plus one virtual chunk, commits, then re-opens via xarray and asserts the values match end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Support writing inlined ChunkManifest entries to icechunk Icechunk's set_virtual_refs_arr rejects the INLINED_CHUNK_PATH sentinel ('__inlined__') as a malformed URL. write_manifest_to_icechunk now writes inlined chunks first as native chunks via store.set, then rewrites those positions to empty strings in the paths array before calling set_virtual_refs_arr with the cleaned view. A cheap numpy-level check skips the virtual-refs call entirely for all-inlined or all-missing manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update broadcast description to reflect bytes replication Broadcasting inlined chunks not only prepends singleton dims to their keys, but also replicates the bytes (by reference) across every position of an expanded axis, per the fix in #938. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This implements the design proposed in #851 (comment) for supporting native chunks in ChunkManifest, which allows loading Kerchunk JSONs with in-lined references or eventually Icechunk stores.
I prefer this design over #794, but still don't really like the amount of
if nativeforks needed. Still, this is such a long-overdue feature that it seems worth releasing soon even if it's not the perfect design. We can also improve the internals later on. Let's discuss at the dev meeting tomorrow._native: dict[tuple[int, ...], bytes]toChunkManifestfor storing in-memory chunk dataChunkEntrywith optionaldata: NotRequired[bytes]fieldManifestStore.get()store.set()for icechunkAcceptance criteria:
ChunkManifestto hold native chunks as well as virtual refs #851docs/releases.md*.mdfile underdocs/api